-
Notifications
You must be signed in to change notification settings - Fork 53
Add MPS control daemon support to k8s device plugin #789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
packages/nvidia-k8s-device-plugin/nvidia-mps-control-daemon-exec-start-conf
Outdated
Show resolved
Hide resolved
packages/nvidia-k8s-device-plugin/nvidia-mps-control-daemon.service
Outdated
Show resolved
Hide resolved
packages/nvidia-k8s-device-plugin/nvidia-mps-control-daemon-exec-start-conf
Show resolved
Hide resolved
packages/nvidia-k8s-device-plugin/nvidia-mps-control-daemon.service
Outdated
Show resolved
Hide resolved
|
^ Updated the code to use the Type changes (thanks @KCSesh!) and responded to a few other comments. There is also a new change that does the MIG and MPS incompatibility check in the template rendering. It echo's a warning these don't work together. This can easily be removed if NVIDIA removes this incompatibility in a future release of their device plugin. |
Add support for NVIDIA Multi-Process Service (MPS) control daemon, including service configuration and device plugin updates. Signed-off-by: Matthew Yeazel <[email protected]>
|
^ Updated to address comments around |
Issue number:
Related to: bottlerocket-os/bottlerocket#4673
Description of changes:
This builds the mps-control-daemon binary from the device plugin that allows MPS support. We have to patch the hardcoded paths for Bottlerocket usage since the device plugin assumes it can write to / which doesn't work with Bottlerocket.
This change also adds a new service to start this binary when settings request it. Otherwise it daemonizes
sleep infinityto let systemdtry-restartupon changing the settings for MPS.The change should be safe to take without the bottlerocket-os/bottlerocket-kernel-kit#347 change or the upcoming settings change but the daemon will not work without the kmod update and the settings being properly set.
Testing done:
Build images with the kernel change, settings changes, and validated that a node will come up with MPS working if set in user data, and the services are restarted and MPS can be enabled at runtime as well.
Setting in userdata for a g6.2xlarge which only has one GPU
Details
eksctlconfig snippet for setting it at the beginning:Results in a node reporting nvidia.com/gpu.shared:
Setting the MPS after boot
Details
Start with a node with no configuration for MPS:
The node shows one GPU:
Then set MPS:
Now check the rest of the system:
And the node shows the empty nvidia.com/gpu offering but now a shared one:
This is a known edge case and is similar to how timeslicing works. In order to avoid old resources, you'd need to start with the user-data approach.
Shifting to
rename-by-default=false(apiclient set settings.kubelet-device-plugins.nvidia.mps.rename-by-default=false) will have the original nvidia.com/gpu resource instead:And finally, setting sharing to
nonedisables MPS:And the resource goes back down to 1.
With the incompatibility checks in the template. You can see the messages preventing both MIG and MPS from running at the same time:
Details
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.